3/07/2020

Our project

We decided to do central Colombia, basically because it is where the capital is.

We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.

Our cities

We decided to consider as central Colombia the following cities:

  • Bogotà DC,
  • Boyacá,
  • Tolima,
  • Cundinamarca,
  • Meta,
  • Quindío,
  • Valle del Cauca,
  • Risaralda, Celdas,
  • Boyacá,
  • Antioquia,
  • Santander
  • Casanare

Loading the dataset

colombia_covid <- as.data.frame(read_csv("data/datasets_567855_1056808_Casos1.csv"))
colnames(colombia_covid)[5] <- "Atención"
colnames(colombia_covid)[8] <- "Tipo"
# slicing the main dataset
central.colombia.dep <- c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca",
    "Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows <- which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid <- colombia_covid[central.colombia.rows, ]

Description of variables

ID de caso: ID of the confirmed case.

Fecha de diagnóstico: Date in which the disease was diagnosed.

Ciudad de ubicación: City where the case was diagnosed.

Departamento o Distrito: Department or distric where the city belongs to.

Atención: Situation of the pacient: recovered, at home, at the hospital, at the ICU or deceased.

Edad: Age of the confirmed case.

Sexo: Sex of the confirmed cade.

Tipo: How the person got infected: in Colombia, abroad or unknown.

País de procedencia: Country of origin if the person got infected abroad.

Map

Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.

Preprocessing

We had to clean the dataset:

  • We transformed the Fecha de diagnóstico variable into a Date variable,

  • we fixed the variable Id de caso (some rows were missing so the numbers weren’t consecutive),

  • we created a variable Grupo de edad,

  • we transformed the variables Grupo de edad, Departamento o Distrito, Ciudad de ubicación, Sexo, Atención, Tipo into factors,

  • we cleaned the column País de procedencia and created the variable Continente de procedencia.

##   ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1          1           2020-03-06              Bogotá             Bogotá D.C.
## 2          2           2020-03-09                Buga         Valle del Cauca
## 3          3           2020-03-09            Medellín               Antioquia
## 4          4           2020-03-11            Medellín               Antioquia
## 5          5           2020-03-11            Medellín               Antioquia
## 6          6           2020-03-11              Itagüí               Antioquia
##     Atención Edad Sexo        Tipo País de procedencia Grupo de edad
## 1 Recuperado   19    F   Importado              Italia         19_30
## 2 Recuperado   34    M   Importado              España         31_45
## 3 Recuperado   50    F   Importado              España         46_60
## 4 Recuperado   55    M Relacionado            Colombia         46_60
## 5 Recuperado   25    M Relacionado            Colombia         19_30
## 6       Casa   27    F Relacionado            Colombia         19_30
##   Continente de procedencia
## 1                    Europa
## 2                    Europa
## 3                    Europa
## 4                  Colombia
## 5                  Colombia
## 6                  Colombia

New datasets

New datasets part2

Exploring the dataset

Number of cases confirmed day by day

Other plots

Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).

brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))

ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +  
                              geom_bar(data = subset(colombia_covid, Sexo == "F")) +
                              geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) + 
                              scale_y_continuous(breaks = brks,
                                               labels = lbls) + 
                              coord_flip() +  
                              labs(title="Spread of the desease across genders",
                                   y = "Number of cases",
                                   x = "Department",
                                   fill = "Gender") +
                              theme_tufte() +  
                              theme(plot.title = element_text(hjust = .5), 
                                    axis.ticks = element_blank()) +   
                              scale_fill_brewer(palette = "Dark3")  

#compute percentage so that we can label more precisely the pie chart
age_groups_pie <- colombia_covid %>% 
  group_by(`Grupo de edad`) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(`Grupo de edad`))
age_groups_pie$label <- scales::percent(age_groups_pie$per)

age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(`Grupo de edad`))) + 
  geom_bar(stat="identity", width = 1) +
  theme(axis.line = element_blank(), 
        plot.title = element_text(hjust=0.5)) + 
  labs(fill="Age groups", 
       x=NULL, 
       y=NULL, 
       title="Distribution of the desease across ages") +
  coord_polar(theta = "y") +
  #geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
  geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
  guides(fill = guide_legend(title = "Group"))
  
age_pie 

Age-Sex plot

Tipo plot

theme_set(theme_classic())

ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
  scale_fill_brewer(palette = "Set3") +
  geom_histogram(aes(fill=Tipo), width = 0.8, stat="count") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title = "Daily number of confirmed cases", 
       subtitle = "subdivided across type",
       x = "Date of confirmation",
       fill = "Type")

Tipo

I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:

type_pie <- colombia_covid %>% 
  group_by(Tipo) %>%
  count() %>%
  ungroup() %>%
  mutate(per=`n`/sum(`n`)) %>% 
  arrange(desc(Tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("Tipo", "Total number", "Percentage")
type_pie
## # A tibble: 3 x 3
##   Tipo        `Total number` Percentage
##   <fct>                <int> <chr>     
## 1 Relacionado            291 29.3%     
## 2 Importado              467 47.0%     
## 3 En estudio             235 23.7%

The frequentist approach

Poisson with Elapsed time as predictor

## [1] "Estimated overdispersion 11.0591714112992"
## [1] "AIC: 446.827929331767"
## [1] "Null deviance:  7859.81"   "Residual deviance: 280.62"

Angela’s attempt

## 
## Call:
## glm(formula = `Cumulative cases/Department` ~ `Elapsed time`, 
##     family = poisson, data = cases_relev_dep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.9994   -4.7902   -2.1236    0.8586   16.3553  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.982659   0.062709   15.67   <2e-16 ***
## `Elapsed time` 0.167306   0.002779   60.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 8732.6  on 82  degrees of freedom
## Residual deviance: 3855.1  on 81  degrees of freedom
## AIC: 4285.4
## 
## Number of Fisher Scoring iterations: 5

New

## [1] "AIC: 317.731647373115"
## [1] "Null deviance:  870.39"    "Residual deviance: 191.78"

Poisson with time plus gender

poisson2 <- glm(`Cumulative cases` ~ `Elapsed time` + Sexo_M, data=data1, family=poisson)
par(mfrow=c(2,2))
plot(poisson2)

paste("AIC:", poisson2$aic)
## [1] "AIC: 448.106201285854"
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson2$null.deviance, deviance(poisson2)), 2))
## [1] "Null deviance:  7859.81"  "Residual deviance: 279.9"

Poisson with Elapsed time plus Group de edad

poisson3 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1, family=poisson)
par(mfrow=c(2,2))
plot(poisson3)

pred.pois3 <- poisson3$fitted.values
res.st3 <- (data1$`Cumulative cases` - pred.pois3)/sqrt(pred.pois3)
#n=25 
#k=7
#n-k=18
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st3^2)/18))
## [1] "Estimated overdispersion 10.001532070592"
paste("AIC:", poisson3$aic)
## [1] "AIC: 376.950305422531"
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson3$null.deviance, deviance(poisson3)), 2))
## [1] "Null deviance:  7859.81"   "Residual deviance: 200.75"

Poisson with Elapsed time, Age and ’Departments` as predictors

#Running poisson with the variable representing time, age and departments as predictors
poisson4 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`, data=data1, family=poisson)
par(mfrow=c(2,2))
plot(poisson4)

pred.pois4 <- poisson4$fitted.values
res.st4 <- (data1$`Cumulative cases` - pred.pois4)/sqrt(pred.pois4)
#n=25 
#k=19
#n-k=6
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st4^2)/6))
## [1] "Estimated overdispersion 0.516840130836562"
paste("AIC:", poisson4$aic)
## [1] "AIC: 203.506208494973"
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson4$null.deviance, deviance(poisson4)), 2))
## [1] "Null deviance:  7859.81" "Residual deviance: 3.3"

Poisson with Elapsed time, Age, Departments and Continent of origin as predictors

#poisson5 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`+`Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica` + `Continente de procedencia_Sudamerica`, data=data1, family=poisson)
#par(mfrow=c(2,2))
#plot(poisson5)
#paste("AIC:", poisson5$aic)
#paste(c("Null deviance: ", "Residual deviance:"),
#       round(c(poisson5$null.deviance, deviance(poisson5)), 2))

ANOVA to compare the Poisson models

#anova(poisson1, poisson3, poisson4, poisson5, test="Chisq")
anova(poisson1, poisson3, poisson4, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: `Cumulative cases` ~ `Elapsed time`
## Model 2: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+`
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + 
##     `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + 
##     `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + 
##     `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + 
##     `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + 
##     `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + 
##     `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + 
##     `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + 
##     `Departamento o Distrito_Valle del Cauca`
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        23    280.624                          
## 2        18    200.747  5   79.878 8.901e-16 ***
## 3         6      3.303 12  197.444 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quasi Poisson with Elapsed time as predictor

poisson1quasi <- glm(`Cumulative cases` ~ `Elapsed time`, data=cases, family=quasipoisson)
par(mfrow=c(2,2))
plot(poisson1quasi)

pred.poisq <- poisson1quasi$fitted.values
res.stq <- (data1$`Cumulative cases` - pred.poisq)/sqrt(summary(poisson1quasi)$dispersion*pred.poisq)
print(paste("Estimated overdispersion", sum(res.stq^2)/23))
## [1] "Estimated overdispersion 0.999988649853625"
paste("AIC:", poisson1quasi$aic)
## [1] "AIC: NA"
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson1quasi$null.deviance, deviance(poisson1quasi)), 2))
## [1] "Null deviance:  7859.81"   "Residual deviance: 280.62"

Quasi Poisson with Elapsed time and Age as predictor

#Let's apply a quasi poisson and see what happens
poisson2quasi <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1, family=quasipoisson)
par(mfrow=c(2,2))
plot(poisson1quasi)

pred.poisq2 <- poisson2quasi$fitted.values
res.stq2 <- (data1$`Cumulative cases` - pred.poisq2)/sqrt(summary(poisson2quasi)$dispersion*pred.poisq2)
print(paste("Estimated overdispersion", sum(res.stq2^2)/18))
## [1] "Estimated overdispersion 0.999984520837828"
paste("AIC:", poisson2quasi$aic)
## [1] "AIC: NA"
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(poisson2quasi$null.deviance, deviance(poisson2quasi)), 2))
## [1] "Null deviance:  7859.81"   "Residual deviance: 200.75"

Negative Binomial with Elapsed time as predictor

nb1 <- glm.nb(`Cumulative cases` ~ `Elapsed time`, data=data1)
par(mfrow=c(2,2))
plot(nb1)

#n=25, k=2, n-k=23
stdres <- rstandard(nb1)
print(paste("Estimated overdispersion", sum(stdres^2)/23))
## [1] "Estimated overdispersion 1.33483954328732"
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb1$null.deviance, deviance(nb1)), 2))
## [1] "Null deviance:  477.96"   "Residual deviance: 27.85"

Negative Binomial with Elapsed time plus Age as predictors

nb2 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1)
par(mfrow=c(2,2))
plot(nb2)

paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb2$null.deviance, deviance(nb2)), 2))
## [1] "Null deviance:  526.5"    "Residual deviance: 28.15"

Negative Binomial with Elapsed time plus Department as predictors

nb3 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`, data=data1)
par(mfrow=c(2,2))
plot(nb3)

paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb3$null.deviance, deviance(nb3)), 2))
## [1] "Null deviance:  2214.86"  "Residual deviance: 26.19"

Negative Binomial with Elapsed time plus Continent of origin as predictors

nb4 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`, data=data1)
par(mfrow=c(2,2))
plot(nb4)

paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb4$null.deviance, deviance(nb4)), 2))
## [1] "Null deviance:  864.03"  "Residual deviance: 23.8"

Negative Binomial with Elapsed time, Age and Departments as pedictors

nb5 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`, data=data1)
par(mfrow=c(2,2))
plot(nb5)

# Calculating overdispersion n=25 k=19 n-k=6
stdres <- rstandard(nb5)
print(paste("Estimated overdispersion", sum(stdres^2)/6))
## [1] "Estimated overdispersion 3.85395829427517"
paste(c("Null deviance: ", "Residual deviance:"),
       round(c(nb5$null.deviance, deviance(nb5)), 2))
## [1] "Null deviance:  7858.91" "Residual deviance: 3.3"

Applying ANOVA to compare the negative binomial models

We decided to compare nb1, nb2, nb5, because they are nested and we are more interested in seeing if the fifth model is in fact better than the first model.

#Applying ANOVA to compare the negative binomial models
anova(nb1, nb2, nb5)
## Likelihood ratio tests of Negative Binomial Models
## 
## Response: Cumulative cases
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Model
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Elapsed time
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                     `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n    `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`
##          theta Resid. df    2 x log-lik.   Test    df  LR stat.      Pr(Chi)
## 1 1.125073e+01        23       -253.9080                                    
## 2 1.257832e+01        18       -251.9442 1 vs 2     5  1.963751 8.541378e-01
## 3 2.453680e+06         6       -165.5091 2 vs 3    12 86.435145 2.409184e-13

Predictive accuracy of the Poisson model

Predicting with a \(95\%\) confidence interval

Predictive accuracy of the Negative Binomial model

Predicting with a \(95\%\) confidence interval

The Bayesian approach

Poisson regression

As a first attempt, we fit a simple Poisson regression:

\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \]

with \(i = 1,\dots,83\), and \(y_i\) represents the number of cases.

Negative binomial model

LOOIC

The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.

Plot the looic to compare models:

loo.all.deps<-c(loo.model.Poisson.complete[3], loo.NB.complete[3], loo.model.NB2.complete[3], loo.model.NB.hier.complete[3])

sort.loo.all.deps<- sort.int(loo.all.deps, index.return = TRUE)$x

par(xaxt="n")
plot(sort.loo.all.deps, type="b", xlab="", ylab="LOOIC", main="All departments")
par(xaxt="s")
axis(1, c(1:4), c("Poisson", "NB-sl", "NB-ml", 
                  "hier")[sort.int(loo.all.deps,
                    index.return = TRUE)$ix],
                    las=2)